Deriving the Hat Matrix and QR Decomposition

Dr. Lucy D’Agostino McGowan

Deriving the Hat Matrix

The Sum of Squared Errors

Recall our objective: \[\text{SSE} = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\]

Goal: Find \(\hat{\boldsymbol{\beta}}\) that minimizes SSE

Method: Take the derivative with respect to \(\boldsymbol{\beta}\) and set equal to zero

Expanding the SSE Expression

Start with: \[\text{SSE} = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\]

Expand: \[\text{SSE} = \mathbf{y}^T\mathbf{y} - \mathbf{y}^T\mathbf{X}\boldsymbol{\beta} - \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}\]

Expanding the SSE Expression

Since \(\mathbf{y}^T\mathbf{X}\boldsymbol{\beta} = \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y}\) (both are scalars): \[\text{SSE} = \mathbf{y}^T\mathbf{y} - 2\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}\]

Taking the Derivative

Matrix calculus rules we need:

  • \(\frac{\partial}{\partial \boldsymbol{\beta}}(\mathbf{a}^T\boldsymbol{\beta}) = \mathbf{a}\)
  • \(\frac{\partial}{\partial \boldsymbol{\beta}}(\boldsymbol{\beta}^T\mathbf{A}\boldsymbol{\beta}) = 2\mathbf{A}\boldsymbol{\beta}\) (when \(\mathbf{A}\) is symmetric)

Taking the derivative: \[\frac{\partial \text{SSE}}{\partial \boldsymbol{\beta}} = -2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}\]

Setting Equal to Zero

Set the derivative equal to zero: \[-2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} = \mathbf{0}\]

Divide by 2: \[-\mathbf{X}^T\mathbf{y} + \mathbf{X}^T\mathbf{X}\boldsymbol{\beta} = \mathbf{0}\]

Setting Equal to Zero

Rearrange: \[\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} = \mathbf{X}^T\mathbf{y}\]

The Least Squares Solution

Solve for \(\hat{\boldsymbol{\beta}}\): \[\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]

Verification in R

set.seed(1)
x <- 1:10
y <- 2 + 3 * x + rnorm(10, 0, 2)

X <- cbind(1, x)

beta_hat <- solve(t(X) %*% X) %*% t(X) %*% y

model <- lm(y ~ x)
cbind(beta_hat, coef(model))
      [,1]     [,2]
  1.662353 1.662353
x 3.109464 3.109464

QR Decomposition

What is QR Decomposition?

Any matrix \(\mathbf{X}\) can be decomposed as: \[\mathbf{X} = \mathbf{Q}\mathbf{R}\]

Where:

  • \(\mathbf{Q}\) is orthogonal: \(\mathbf{Q}^T\mathbf{Q} = \mathbf{I}\)
  • \(\mathbf{R}\) is upper triangular

What is QR Decomposition?

Any matrix \(\mathbf{X}\) can be decomposed as: \[\mathbf{X} = \mathbf{Q}\mathbf{R}\]

In words: Break down \(\mathbf{X}\) into perpendicular directions (\(\mathbf{Q}\)) and scaling/rotation (\(\mathbf{R}\))

Why Use QR for Regression?

Traditional approach requires: \[\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]

Why Use QR for Regression?

Problems with \((\mathbf{X}^T\mathbf{X})^{-1}\):

  • Computing \(\mathbf{X}^T\mathbf{X}\) can be numerically unstable
  • Matrix inversion is computationally expensive
  • Condition number gets squared: \(\kappa(\mathbf{X}^T\mathbf{X}) = \kappa(\mathbf{X})^2\)

Demo

What do I mean by “unstable”

x <- seq(1, 500, len = 30)
X <- cbind(1, x, x^2, x^3)
beta <- matrix(c(1, 1, 1, 1), 4, 1)

set.seed(1)
y <- X%*%beta + rnorm(30)
solve(crossprod(X))

Error in solve.default(crossprod(X)) : system is computationally singular: reciprocal condition number = 3.11914e-17

log10(crossprod(X))
                   x                    
  1.477121  3.875929  6.406189  8.987506
x 3.875929  6.406189  8.987506 11.596810
  6.406189  8.987506 11.596810 14.223800
  8.987506 11.596810 14.223800 16.862984

QR Solution

If \(\mathbf{X} = \mathbf{Q}\mathbf{R}\), then: \[\mathbf{X}^T\mathbf{X} = (\mathbf{Q}\mathbf{R})^T(\mathbf{Q}\mathbf{R}) = \mathbf{R}^T\mathbf{Q}^T\mathbf{Q}\mathbf{R} = \mathbf{R}^T\mathbf{R}\]

QR Solution

So: \[ \begin{align} \hat{\boldsymbol{\beta}} &= (\mathbf{R}^T\mathbf{R})^{-1}\mathbf{R}^T\mathbf{Q}^T\mathbf{y} \\&= \mathbf{R}^{-1}(\mathbf{R}^T)^{-1}\mathbf{R}^T\mathbf{Q}^T\mathbf{y}\\& = \mathbf{R}^{-1}\mathbf{Q}^T\mathbf{y} \end{align} \]

QR Advantages

  • Only need to solve \(\mathbf{R}\hat{\boldsymbol{\beta}} = \mathbf{Q}^T\mathbf{y}\) (back substitution)
  • More numerically stable
  • Avoids explicit matrix inversion

You Try: QR Properties

Given \(\mathbf{X} = \mathbf{Q}\mathbf{R}\), verify that:

  1. \(\mathbf{X}^T\mathbf{X} = \mathbf{R}^T\mathbf{R}\)
  2. The hat matrix becomes \(\mathbf{H} = \mathbf{Q}\mathbf{Q}^T\)

Hint: Use the fact that \(\mathbf{Q}^T\mathbf{Q} = \mathbf{I}\)

05:00

Backsolve Example

Given: \[\mathbf{R} = \begin{bmatrix} 2 & 1 & 3 \\ 0 & 4 & 2 \\ 0 & 0 & 5 \end{bmatrix}, \quad \mathbf{Q}^T\mathbf{y} = \begin{bmatrix} 8 \\ 12 \\ 15 \end{bmatrix}\]

Solve: \(\mathbf{R}\hat{\boldsymbol{\beta}} = \mathbf{Q}^T\mathbf{y}\)

Step-by-Step Backsolve

\[\begin{bmatrix} 2 & 1 & 3 \\ 0 & 4 & 2 \\ 0 & 0 & 5 \end{bmatrix} \begin{bmatrix} \hat\beta_1 \\ \hat\beta_2 \\ \hat\beta_3 \end{bmatrix} = \begin{bmatrix} 8 \\ 12 \\ 15 \end{bmatrix}\]

Step 1: Solve bottom row first \[5\hat\beta_3 = 15 \implies \hat\beta_3 = 3\]

Step-by-Step Backsolve

\[\begin{bmatrix} 2 & 1 & 3 \\ 0 & 4 & 2 \\ 0 & 0 & 5 \end{bmatrix} \begin{bmatrix} \hat\beta_1 \\ \hat\beta_2 \\ \hat\beta_3 \end{bmatrix} = \begin{bmatrix} 8 \\ 12 \\ 15 \end{bmatrix}\]

Step 2: Substitute into second row \[4\beta_2 + 2(3) = 12 \implies 4\beta_2 = 6 \implies \beta_2 = 1.5\]

Step-by-Step Backsolve

\[\begin{bmatrix} 2 & 1 & 3 \\ 0 & 4 & 2 \\ 0 & 0 & 5 \end{bmatrix} \begin{bmatrix} \hat\beta_1 \\ \hat\beta_2 \\ \hat\beta_3 \end{bmatrix} = \begin{bmatrix} 8 \\ 12 \\ 15 \end{bmatrix}\]

Step 3: Substitute into first row \[2\beta_1 + 1(1.5) + 3(3) = 8 \implies 2\beta_1 = -2.5 \implies \beta_1 = -1.25\]

Backsolve Solution

\[\hat{\boldsymbol{\beta}} = \begin{bmatrix} -1.25 \\ 1.5 \\ 3 \end{bmatrix}\]

Key insight: Work backwards through the triangular matrix, using previously solved values to eliminate unknowns in each step.

QR in R

# Using our previous data

# QR decomposition
qr_decomp <- qr(X)
Q <- qr.Q(qr_decomp)
R <- qr.R(qr_decomp)

# Solve using QR
beta_qr <- backsolve(R, t(Q) %*% y)
beta_qr

Why QR Matters

Condition number comparison:

# Compare condition numbers
kappa(X)               # Condition of X
[1] 158915720
kappa(t(X) %*% X)      # Condition of X'X (much worse!)
[1] 3.165203e+16

You Try: QR Properties

Given \(\mathbf{H} = \mathbf{Q}\mathbf{Q}^T\) calculate \(\hat{y}\) using the QR elements.

05:00

Interpretation for Linear Regression

What Does SSE Tell Us?

  • Measures total unexplained variation
  • Smaller SSE = better fit
  • Units are (units of y)²

SSE alone isn’t enough

  • Depends on sample size and scale of y
  • Need context for interpretation

From SSE to More Useful Metrics

Mean Squared Error (MSE): \[\text{MSE} = \frac{\text{SSE}}{n-p} = \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{n-p}\]

Why \((n-p)\) instead of \(n\)?

  • \(p\) = number of parameters estimated
  • Corrects for degrees of freedom used (provides unbiased estimate of error variance)

Interpreting Coefficients

Simple Linear Regression: \(y = \beta_0 + \beta_1 x + \varepsilon\)

  • \(\beta_0\) (intercept): Expected value of \(y\) when \(x = 0\)
  • \(\beta_1\) (slope): Expected change in \(y\) for a 1-unit increase in \(x\)

Interpreting Coefficients

Multiple Regression: \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \varepsilon\)

  • \(\beta_1\): Expected change in \(y\) for a 1-unit increase in \(x_1\), holding \(x_2\) constant
  • \(\beta_2\): Expected change in \(y\) for a 1-unit increase in \(x_2\), holding \(x_1\) constant

You Try: Coefficient Interpretation

Given the regression: \(\text{Salary} = 40000 + 2000 \times \text{YearsExperience} + 5000 \times \text{HasDegree}\)

Where HasDegree = 1 if person has degree, 0 otherwise

Interpret each coefficient:

  1. What does 40000 represent?
  2. What does 2000 represent?
  3. What does 5000 represent?
04:00

Practical Exercise

  • Log in to RStudio Pro.
  • Using the teengamb dataset from the faraway package,
  • Predict gambling expenditure using sex, status, and income
  • Do this 3 ways:
  1. using the derivation from our derivative of the SSE
  2. using QR decomposition
  3. using the lm function